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Abstract 

Genomic Signal Processing is a relatively new field in bioinformatics, in which 
signal processing algorithms and methods are used to study functional 
structures in the DNA. An appropriate mapping of the DNA sequence into 
one or more numerical sequences enables the use of many digital signal 
processing tools in the analysis of different genomic sequences. Also, a novel 
Influenza A (HiNl) virus of swine origin emerged in the spring of 2009 and 
spread very rapidly among people. The severity of the disease and the 
number of deaths caused by a pandemic virus varies greatly and can change 
over time. Throughout this worb. Pandemic HiNl genomic sequences were 
characterized according to nonlinear dynamical features such as moment 
invariants and largest Lyapunov exponents and then compared to those 
features that extracted from classical HiNl genomic sequences. The proposed 
methods were applied to a number of sequences encoded into a time series 
using a coding measure scheme employing Electron-Ion Interaction Pseudo- 
potential (EIIP). The aim of this worb is to extract genomic features that can 
distinguish the new swine flu from the classical HiNl existed before using 
sequences from segment 8 of the influenza genome that consists of 8 RNA 
segments which encodes two important proteins for immune system attack 
(NSl and NS2). According to the obtained results it is evident that variability is 
present based on a significance test in both groups; pandemic and classical 
HiNl sequences. 
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Introduction 

The variations of pandemic HlNl influ- 
enza virus are caused as a result of different 
mutations occurring during viral replication 
The polymerase of this RNA virus lacks 
proof reading activity this gives rise to 
considerable viral variability culminating in 3 
different types A, B and C, in addition to 
many subtypes based on variations in the 
hemagglutinin (HA) and the neuraminidase 
(NA) surface proteins The influenza gen- 
ome consists of 8 RNA segments and encodes 



for 10 polypeptides; the internal structural 
proteins, nucleocapsid protein (NP), the two 
matrix protein (M) are used for the classifica- 
tion of the influenza virus into A, B and C. 
The surface proteins neuraminidase (NA) and 
hemagglutinin (HA) have been studied exten- 
sively and the antigenic variations in the these 
surface glycoproteins are used to subtype 
Influenza A. Additionally, three of the influ- 
enza polypeptides are associated with RNA 
polymerase activity (PA, PBl, PB2), and the 
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RNA binding non-structural protein (NS) that 
contribute to viral pathogenicity and play a 
central role in the prevention of interferon 
mediated antiviral response. The Influenza A 
Virus (lAV) undergoes major and minor gen- 
etic variations, the yearly antigenic drift re- 
sulting in as minor as a single amino acid 
mismatch. Major variations known as anti- 
genic shifts are the cause of serious outbreaks 
and pandemics as the 1918, 1957, and 1968 
worldwide outbreaks Changes in the genet- 
ic and antigenic composition result in chal- 
lenges in the development of influenza vac- 
cines and antiviral medications 

In the last two decades, there has been an 
increasing interest in applying techniques 
from the domains of nonlinear analysis and 
chaos theory in different fields of research. In 
this work, the chaos theory was applied to 
both pandemic HlNl and classical HlNl 
genomic sequences in order to discriminate 
between them according to their non linear 
dynamical features as moment invariants, and 
Largest Lyapunov Exponents (LLE). 

Materials and Methods 

The conversion of the DNA sequences into 
digital signals offers the possibility of apply- 
ing signal processing methods to the analysis 
of genomic data The genomic signal pro- 
cessing applications in bioinformatics pro- 
vides an efficient tool used to extract features 
of DNA sequences maintained over the whole 
genomes In this work, the EIIP sequence 
indicators were used, the energy of delocal- 
ized electrons in amino acids and nucleotides 
has been calculated as the Electron-Ion Inter- 
action Pseudopotential (EIIP). The EIIP 
values of amino acids were used to substitute 
for the corresponding amino acids in protein 
sequences, whose power spectrum is taken to 
extract the information contents To study 
the dynamics of the proposed system, the 
state space trajectory was first reconstructed. 
Phase space reconstruction is the fundamental 
for analyzing nonlinear signals, by which a 
time series can be embedded to n-dimensional 
space. 



Briefly the basic steps of the reconstruction 
of the phase space were demonstrated. First, 
different sequences of the pandemic HlNl 
and classical HlNl which existed before were 
encoded into a time series signal using EIIP 
sequence indicators. A good choice for a 
delay time was yielded by using the first min- 
imum of the auto mutual information func- 
tion. The first minimum of the auto mutual 
information could be found at four. The min- 
imal embedding dimension for the pandemic 
HlNl and classical HlNl time series signals 
were calculated using Cao's method with a 
delay time of four, a maximal dimension of 
eight, three nearest neighbors and reference 
point depending on the length of each signal. 
There was a kink produced by Cao's method 
at 3. This kink represents the time delay re- 
construction of pandemic and classical HlNl 
time series signals with embedding dimension 
of 3 and delay of 4. Finally, the phase space 
trajectory was obtained for both time series 
signals of the two types of HlNl genomic 
sequences (pandemic and classical). The step 
following obtaining the phase trajectory is the 
step of feature extraction 

Feature extraction 

TSTOOL software package is used to esti- 
mate the extracted nonlinear dynamical fea- 
tures; it is a software package for signal pro- 
cessing with emphasis on nonlinear time- 
series analysis 

Moment invariants 

Features obtained by moment invariants are 
simple calculated features that do not change 
under translation, scaling or rotation 
These invariants are constructed using the 
generalized fundamental theorem of moment 
invariants (GFTMI), which was formulated as 
in The n-dimensional moments of order p 
of a function of intensity yO (xi, Xn) = (x) 
are defined in terms of Rieman integral as: 

1) m ^_^...^,_ = \-\x? xl" p(^'>dxi--dx„ 

Where pi + ... pn = p, 0 < p < co. It is 
assumed that p (x) is piecewise continuous 
and therefore bounded function, and it can 
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have nonzero values only in a finite part of 
the R"; then the moments of all orders exist. 
The central moments: 

00 'I 

2) l^p,....p„ = \ - 1(^1 -xi) {x„-x„f' p{x)dx^....dx^ 
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The seven features of moment invariants: 
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Largest lyapunov exponent (LLE) 

In this work, a set of genomic sequences 
from segment 8 of the influenza genome of 
both pandemic and classical HlNl was down- 
loaded from the NCBL The length of these 
sequences was chosen to be 800-1000 bp. 
These sequences are first encoded using EIIP 
sequence indicators. Then, the phase space 
trajectory was reconstructed for each time 
series of both of them. The TSTOOL larglyap 
algorithm was used to estimate the Largest 
Lyapunov Exponent (LLE). This algorithm is 



similar to Wolfs algorithm and provides an 
efficient estimation of the Largest Lyapunov 
Exponent through the calculation of the rate 
of increase of the prediction error versus the 
pre-diction time ^^'^\ 

Results 

Results of moment invariants 

Features based on moment invariants were 
computed after the construction of phase 
space of both pandemic and classic HlNl 
EIIP encoded sequences. The seven features 
are arranged as ((pi, (p2, (ps, (p4, (ps, (p?, and (p8). 
A significance test (t-test) was performed on 
the proposed features to assess the use of such 
parameters for discriminating between them. 
The result of the t-test is presented and the p 
value is calculated for all seven features; they 
are all less than 0.05 as shown in table 1. 
Figure 1 shows the result of comparing the 
average features extracted based on moment 
invariants for pandemic and classical HlNl. 
There is a significant difference between the 
two types of HlNl as shown in the figure. 
Also, small vertical bars represent a standard 
deviation across features. 

Results of largest lyapunov exponent (LLE) 

The LLE estimates of a set of pandemic 
and classical HlNl genomic sequences were 
calculated using TSTOOL largelyap algo- 
rithm as shown in table 2. It is an algorithm 
very similar to the Wolf algorithm; it com- 
putes the average exponential growth of the 
distance of neighboring orbits via the predic- 
tion error. The increase of the prediction error 
versus the prediction time allows an estima- 

Table 1. P- value of t-test on a set of pandemic 
and classical HlNl EIIP encoded sequences for 
feature extracted using moment invariants 



Moment invariants 
feature 


p-value 


(pi 


1.8235e-004 


m 


2.2912e-005 


m 


1.3674e-010 


0)4 


0.0012 


©5 


0.0288 


0)7 


2.2912e-005 


m 


6.7141e-005 
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Figure 1. Features extracted based on moment 
invariants for pandemic and classical HlNl, the small 
vertical bars represent standard deviations across 
features 

tion of the Largest Lyapunov Exponent. A 
significance t-test was applied to assess the 
use of LLE estimates in the discrimination 
between pandemic and classical HlNl. 

Significance test 

The accuracy of a test was evaluated to dis- 
criminate between pandemic HlNl and clas- 
sical HlNl by moment invariants and Largest 
Lyapunov Exponent dynamical system fea- 
tures). These features were divided into three 
feature vectors as follows: 
VI = {01, 02, 03, 04, 05, 07, 08} 
V2= {LLE} 

V3= {01, 02, 03, 04, 05, 07, 08, LEE} 

Table 2. Largest Lyapunov Exponent estimates of 
pandemic and classical HlNl encoded sequences 



LLE (Pandemic 
HlNl) 


LLE (Classical 
HlNl) 


2.6932 


0.3428 


2.7103 


0.3601 


2.7113 


0.3628 


2.7142 


0.3667 


2.7153 


0.3795 


2.7280 


0.3854 


2.7392 


0.4429 


2.7475 


0.4491 


2.7506 


0.4533 


2.7567 


0.5569 


2.7577 


0.6274 


2.7722 


0.7417 


2.8972 


0.7509 


2.9075 


0.8078 


2.9178 


0.8891 


2.9472 


0.9501 


2.9508 


0.9677 



The feature vectors were fed into the classi- 
fication process using K-means clustering 
classifier. Results of applying the significance 
test are shown in table 3. 

Table 3. Accuracy of the proposed nonlinear pattern 
recognition method using K- means classifier 





VI 


V2 


V3 


Pandemic HlNl 


100% 


100% 


100% 


Classical HlNl 


70.8% 


100% 


100% 



Discussion 

The proposed techniques were implement- 
ed and applied to a number EIIP encoded se- 
quences of pandemic and classical HlNl 
from segment 8 of the infiuenza genome to 
identify their genomic signatures as continu- 
ous detection of these signatures is important 
in the analysis of the adaptation process from 
nonhumans to humans. 

As to chaotic features extracted based on 
moment invariants, the seven features are ar- 
ranged as {(pi, (p2, (p3, (pA, (ps, (pi, and (p8). 
Considering the p-values: if p< 0.05 there is a 
significant difference, if p>0.05 there is no 
significant difference. The results show that 
these features generally support the hypo- 
thesis that they have a potential to discrimin- 
ate between pandemic and classical HlNl as 
they all <0.05. 

As to chaotic features based on LLE esti- 
mates, the p-value of the t-test was calculated 
as 2.1546e-019 which is < 0.05. To validate 
this result, a random DNA sequence of length 
1000 bp was generated, the Largest Lyapunov 
Exponent (LLE) of this random sequence was 
estimated at 1.4046 and compared to the 
average LLE estimates of pandemic HlNl 
(2.8218) and the average LLE estimates of 
classical HlNl (0.4697). The results confirm 
that pandemic HlNl genomic sequences can 
be statistically differentiated from classical 
HlNl genomic sequences by LLE dynamical 
features. 

Conclusion 

The analysis of different genomic muta- 
tions of the pandemic HlNl genomic se- 
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quences is very important to study the pos- 
sibility of virus adaptation from non-humans 
to humans. 

A study of nonlinear dynamics of pandemic 
and classical HlNl genomic sequences of 
segment 8 of the influenza genome was 
presented to discriminate between them by 
their moment invariants and Largest Lya- 
punov Exponent (LLE) estimates. The results 
of this work were supported by statistical 
analysis indicating that the discrimination 
between these two types of HlNl provides a 
clear outline for the potential of using such 
nonlinear dynamical features with high accur- 
acy. The study shows that using these nonlin- 
ear dynamical features will open the door to 
extract more patterns to be used in monitoring 
and extracting all HlNl genomic signatures. 
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